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Title: Embeddable Flash Memory System for Non-Volatile Storage of Code, 
Data and Bit-Streams for Embedded FPGA Configurations. 

DESCRIPTION 

Field of the invention 

5 The present invention relates to an embeddable Flash memory system for 
non- volatile storage of code, data and bit-streams for embedded FPGA 
configurations. 

More specifically, the invention relates to memory system integrated into a 
single chip, together with a microprocessor and including a modular array 
10 structure comprising N memory blocks. 

Prior art 

As is well known in this specific technical field, the continuous size and 
price reduction in hand held digital equipment together with demanding 
computing performance and low power constraint for consumer 
15 applications, is increasing the need for a technology that combines high 
performance digital CMOS transistor and non-volatile flash memory. 

For instance, an efficient power block for a memory device is disclosed in 
the article by R. Pelliconi, D. Iezzi, A. Baroni, M. Pasotti, P. L. Rolandi, 
"Power efficient charge pump in deep sub micron standard CMOS 
20 technology", Proceedings of 27th ESSCIRC, pplOO-103, Sept. 2001. 

At the same time raising costs of mask sets and shorter time-to-market 
available for new products, are leading to the introduction of systems with 
a higher degree of programmability and configurability, such as system- 
on-chip with configurable processors, embedded FPGA and embedded 
25 flash memory. 

In this respect, the availability of an advanced embedded flash technology, 
based on NOR architecture, together with innovative IP's, like embedded 
flash macrocells with special features, is a key factor. 

For a better understanding of the present invention reference is made to 
30 the Field Programmable Gate Array (FPGA) technology combining 
standard processors with embedded FPGA devices. 
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These solutions allows to configure into the FPGA at deployment time 
exactly the required peripherals, exploiting temporal re-use by 
dynamically reconfiguring the instruction- set at run time based on the 
currently executed algorithm. 

The existing models for designing FPGA/processor interaction can be 
grouped in two main categories: 

- the FPGA is a co-processor communicating with the main processor 
through a system bus or a specific I/O channel; 

- the FPGA is described as a function unit of the processor pipeline. 

The first group includes the GARP processor, known from the article by T. 
Callahan, J. Hauser, and J. Wawrzynek having title: "The Garp 
architecture and C compiler" IEEE Computer, 33(4) : 62-69, April 2000. A 
similar architecture is provided by the A-EPIC processor that is disclosed 
in the article by S. Palem and S. Talla having title: "Adaptive explicit 
parallel instruction computing", Proceedings of the fourth Australasian 
COmputer Architecture Conference (ACOAC), January 2001. 

In both cases the FPGA is addressed via dedicated instructions, moving 
data explicitly to and from the processor. Control hardware is kept to a 
minimum, since no interlocks are needed to avoid hazards, but a 
significant overhead in clock cycles is required to implement 
communication. 

Only when the number of cycles per execution of the FPGA is relatively 
high, the communication overhead may be considered negligible. 

In the commercial world, FPGA suppliers such as Altera Corporation offer 
digital architectures based on the US Patent No. 5,968,161 to T.J. 
Southgate, "FPGA based configurable CPU additionally including second 
programmable section for implementation of custom hardware support". 

Other suppliers (Xilinx, Triscend) offer chips containing a processor 
embedded on the same silicon IC with embedded FPGA logic. See for 
instance the US Patent 6,467,009 to S.P. Winegarden et al., "Configurable 
Processor System Unit", assigned to Triscend Corporation. 

However, those chips are generally loosely coupled by a high speed 
dedicated bus, performing as two separate execution units rather than 
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being merged in a single architectural entity. In this manner the FPGA 
does not have direct access to the processor memory subsystem, which is 
one of the strengths of academic approaches outlined above. 

In the second category (FPGA as a function unit) we find architectures 
5 commercially known as: "PRIS.C"; "Chimaera* and "ConCISe". 

In all these models, data are read and written directly on the processor 
register file minimizing overhead due to communication. In most cases, to 
minimize control logic and hazard handling and to fit in the processor 
pipeline stages, the FPGA is limited to combinatorial logic only, thus 
10 severely limiting the performance boost that can be achieved. 

These solutions represent a significant step toward a low-overhead 
interface between the two entities. Nevertheless, due to the granularity of 
FPGA operations and its hardware oriented structure, their approach is 
still very coarse-grained, reducing the possible resource usage parallelism 
15 and again including hardware issues not familiar nor friendly to software 
compilation tools and algorithm developers. 

Thus, a relevant drawback in this approach is often the memory data 
access bottleneck that often forces long stalls on the FPGA device in order 
to fetch on the shared registers enough data to justify its activation. 

20 The technical problem of the present invention is that of providing a new 
kind of embeddable memory architecture having functional and structural 
features capable to offer significant performance and energy consumption 
enhancements with respect to a traditional signal processing device. 

Summary of invention 

25 The invention overcomes the limitations of similar preceding architectures 
relying on a embedded device of different nature, and a new approach to 
processor/ memory interface. 

According to a first embodiment of the present invention, said embeddable 
Flash memory system includes a modular array structure comprising N 
30 memory blocks, wherein a power block, including charge pumps, is 
shared among different flash memory modules through a PMA arbiter in a 
multi-bank fashion. 
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Moreover, the embeddable Flash memory system according to the 
invention includes three different access ports, each for a specific 
function: 

- a code port CP (10) optimized for random access time and the 
5 application system; 

- a data port DP (11) allowing an easy way to access and modify 
application data; and, 

- an FPGA port FP (12) offering a serial access for a fast download of bit 
streams for an embedded FPGA (e-FPGA) configurations. 

10 The features and advantages of the digital architecture according to this 
invention will become apparent from the following description of a best 
mode for carrying out the invention given by way of non-limiting example 
with reference to the enclosed drawings. 

Brief description of the drawings 
15 Figure 1 is a block diagram of a memory architecture for data storage 
processing according to the present invention; 

Figure 2 is a block diagram of a programming circuit with a gate ramp 
slope dependent on current required by memory cells under programming; 

Figure 3 is a schematic diagram of programming gate voltage ramp slope 
20 with 128 cells in programming (0.3V/us); 

Figure 4 is the real diagram of the voltage ramp slope of Figure 3; 
Figure 5 is a block diagram of a sense amplifier; 

Figure 6 is a block diagram of a power management block architecture; 
Figure 7 is photographic representation of the memory architecture 
25 according to the present invention. 

Detailed description 

With reference to the drawings views, generally shown at 1 is an 
embeddable Flash memory system for non -volatile storage of code, data 
and bit-streams for embedded FPGA configurations realized according to 
30 the present invention. 
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More specifically, a 8Mb application-specific embeddable flash memory 
system is disclosed. The memory system may be integrated into a single 
chip together with a microprocessor 2. 

This memory architecture 1 includes three different access ports, each for 
5 a specific function: 

- a code port 10 (CP), that is optimized for random access time and the 
application system; this port 10 may be used as a in usual flash 
memory; 

- a data port 11 (DP) allowing an easy way to access and modify 
10 application data; and, 

- an FPGA port 12 (FP) offering a serial access for a fast download of bit 
streams for an embedded FPGA (e-FPGA) configurations. 

A test chip will be further presented being integrated for performance 
assessment as well as for design and built in self test methodology 
15 validation. A special automatic programming gate voltage ramp generator 
circuit that allows a programming rate of 1 Mbyte/ s and an erase time of 
200ms, has also been introduced as will be further clarified. 

The memory system architecture 1 is schematically shown in Figure 1. 
The architecture comprises a modular memory 13 (dotted line) including 
20 charge pumps 14 (Power Block), testability circuits 16 (DFT), a power 
management arbiter 15 (PMA) and a customizable array of N independent 
2Mb flash memory modules 3. 

Depending on the storage requirements and performances, the number of 
modules 3 can be varied, in the current non limiting example the number 
25 of modules has been chosen N=4. 

The modular memory 13 includes (N+2) 128-bit target ports and 
implements a N-bank uniform memory. 

As previously mentioned, three content-specific ports 10, 11 and 12 are 
dedicated to code (CP, 64-bit wide), data (DP, 64-bit) and FPGA bit stream 
30 configurations (FP, 32-bit). A 128 bit sub-system crossbar 15 connects all 
the architecture blocks and the eight bit microprocessor 2. 

The main features of such a flash memory system are: charge pump 14 
sharing among different flash memory modules 3 through the PMA arbiter 
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15 in a multi-bank fashion. Moreover, the use of a small eight bit micro 
processor 2 to easy memory system test and to add complex 
functionalities for data management, and the use of an ADC (Analog-to- 
Digital Converter), required by the application, to increase system self test 
5 capability. 

Let's now evaluate in more details the main features of the inventive 
memory architecture. 

Each flash memory module 3 has a size of 2Mb and has a 128-bit IO data 
bus with 40ns access time, resulting in 400Mbyte/ s overall throughput, 
10 and a program/erase control unit. 

All the high voltages generation section is in the power block 10 which is 
shared by each of the four 2Mb flash memory modules 3. 

A 1 Mbyte /s programming rate with 128bit word requires that the 
programming charge pumps of the block 10 can supply up to 3mA of 

15 programming current. 

These charge pumps are usually sized to sustain operations in worst case 
conditions of process and temperature variations with all bits of a word in 
programming. This leads to increase the charge pump area of more than 
130% with respect to typical conditions, when just half of the bits in a 

20 word will be programmed. 

This memory architecture further comprises a programming circuit 9 that 
overcomes this problem and is shown in Figure 2. 

As may be appreciated, the memory cells 6 are organized in a memory 
matrix with associated row and column decoders. A multiplexer 7, fed by 
25 the output of a voltage regulator 17 bias the memory matrix rows with a 
Vread voltage, while a program switch 8 fed by another voltage regulator 
18 bias the memory matrix columns with a Vd voltage generated through 
a Vpd voltage value supplied by the charge pumps. 

Referring now to figures 3 and 4, after a preliminary program verify 
30 operation, phase A, the programming operation starts in phase B and the 
programming circuit 9 will move the gate voltage with a maximum slope 
defined by the operational amplifier slew-rate, until the memory cells sink 
all the available current. 
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In a third phase C the gate voltage reaches a level that switches on all the 
cells in programming and the charge pump output voltage Vpd lowers 
from 6V to 5V. 

The operational amplifier changes the word-line voltage slope to fix the 
5 voltage Vpd at that voltage value where the charge pump can deliver 
current at maximum efficiency. 

This current programs the flash cells 6 through the voltage regulator 18, 
that keeps the flash cell drain voltage Vd at a fixed value of 4.5V. 

In phase C all the cells 6 under programming will see their thresholds 
10 moving at the same rate and the generated programming gate voltage 
becomes a linear ramp, at first order, with an optimum slope defined by 
the current from the charge pump and by the number of bits in 
programming, enabling the memory cells 6 to use all the available current, 
modulating the programming gate voltage and, consequently, the 
15 programming speed. 

As application data are supposed to be frequently modified, erased, and 
programmed thresholds distributions have been carefully positioned 
taking into account this assumption together with reliability and power 
consumption considerations. 

20 When a memory cell 6 is programmed or erased, power is consumed to 
move its threshold from the erased state to the programmed state, or from 
the programmed state to the erased state. The higher is the voltage 
separating the two states and the more is the power consumed to change 
the state of the cell. 

25 With erased and programmed states voltage distance of about 2V and 
using an accurate sense amplifier, good reliability and power consumption 
performances can be obtained because programming and erasing 
algorithms are rapidly converging. 

In typical conditions, 64bit programmed out of 128bit, a programming 
30 time of less then 16ns is obtained. Figure 4 is a real diagram showing the 
worst case program operation, when all 128bit of a word are programmed. 
As may be appreciated, it is completed in just one programming pulse 
even when the gate programming voltage reaches a relatively low voltage 
(~6V) and a total programming time of -18ns for 16Byte is obtained. 
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The erasing function takes full advantage from the programming circuit 
accelerating the soft-programming phase and using large parallelism it is 
possible to have very short verify phases. A sector is erased in typically 
200ms. 

5 The sense amplifier shown in Figure 5 is of known type and is able to 
operate down to 1.5V. This closed loop circuit enhances precision and 
current/ voltage gain as needed to work with closer thresholds margins. A 
40ns access time is obtained, or 400Mbyte/ s read rate, that allows a 32bit 
processor to run at up to 100MHz. 
10 The memory system 1 includes four 2Mbit flash memory modules 3 that 
can be requested to perform one of three operations (read, write, erase) at 
the same time and independently. Simultaneous memory operations use 
the power management arbiter block 15 (PMA) for optimal scheduling. 

Available power and user-defined priorities are considered to schedule 

15 conflicting resource requests in a single clock cycle. 

Reminding that the write operation is composed by two different basic 
operations, program pulse and verify (a sequence that can be repeated), 
while the erase operation is composed by three different basic operations, 
erase pulse, verify (erase verify and depletion verify) and soft-program, 

20 each time a flash memory module enters a new basic operation, it sends a 
request to the PMA arbiter block 15 for all the needed high voltages 
allocation. 

Read and verify are the only operations allowed to occur at the same time 
in the four flash memory modules 3, while the basic program pulse and 
25 basic erase pulse operations can be performed in just one memory module 
3 at a time. 

The main component of the PMA arbiter block 15, shown in Figure 6, is 
the order block 19. It orders the memory modules 3 requests following 
these rules: 

30 status of the request (already active or new request); 
priority information. 
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The requests are collected and processed in parallel by three stages: 
encoders, comparators and one level of logic. The response is available in 
only one clock cycle. 

Referring to Figure 6, a switch block 21 satisfies the flash memory 
5 modules 3 requests based on the order block 19 output. A request decoder 
is provided block 20 req_dec for enabling the required high voltage 
resources (charge pumps 14) while a corresponding pump driver block 22 
manages the power down/ stand-by timeout and limits the requests for 
each resource to the maximum allowed. 

10 The correspondence between requests and high voltage resources, power 
down and standby time, and the maximum number of parallel requests 
that is possible to satisfy are configurable. 

Let's see in greater detail the function of the three ports 10, 11 and 12. 

The first port 10 is dedicated to manage application code stored in flash 
15 memory modules. It has also the possibility to write in the memory areas 
for DP, to perform memory formatting, and FP, to store downloading 
configurations . 

The code port CP 10 has four configuration registers defining its 
addressable memory space: two at the application level, and two at the 
20 flash memory modules level. 

The I/O data word bus is 64 bits wide, while the address bus is 32 bits. 

The port uses one chip select to access in the addressable memory space. 
During operations (read and write), the port acts as a conventional RAM 
memory, using a write enable in case of write operation. As this port 
25 allows the erase operation, necessary before a write operation in a flash 
memory modules, an erase enable input signal has been added. 

During a read operation, an output ready signal is tied low when data are 
not available immediately, so that it can acts as a wait state signal. 

The second data port DP 11 is dedicated to manage application data 
30 stored in flash memory modules, eventually organized in a file system by 
the application, using a typical data page of 512B. 
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The DP has four configuration registers defining its addressable memory 
space: two at the application level, and two at the flash memory modules 
level. 

The I/O data word bus is 64 bits wide, while the address bus is 32 bits. 

Functions are offered to give the application the possibility to implement a 
file system for data management. 

The operations available are Read Page, Read Word, Write Page, Invalid 
Page and Defragmentation. A 512B SRAM page buffer allows the 
application to exchange data in burst mode at maximum speed to increase 
performance especially during 

The erase operation is not available because it is hidden by the micro 
controller 2 that does a logic remapping of physical address. 

Furthermore in Write Page, the physical address is chosen using an 
algorithm that takes into account the filling status of sectors. If there are 
full sectors with invalid page, a defragmentation operation is automatically 
started to increase free space, and sectors are eventually erased during 
this operation. 

A Port Status Register is available and can be directly read in order to get 
information about the status of current operations. 

The third FP port 12 is dedicated to manage embedded-FPGA (e-FPGA) 
configurations data stored in flash memory modules. The FP port is read- 
only and provides fast sequential access for bit streams downloading. 

The FP has four configuration registers replicating the information stored 
in CP port that must be used in order to write e-FPGA configurations data. 

The output data word bus and the address bus are 32 bits wide. The FP 
port uses a chip select to access in the addressable memory space, and a 
burst enable to allow burst serial access. . . . . ■ . , 

In read operation, an output ready signal is tied low when data are not 
immediately available, so that it can acts as a wait state signal. 

The eight-bit microprocessor 2 (uP) performs additional complex functions 
(defragmentation, compression, virtual erase, etc.) not natively supported 
by the DP port 1 1, and assists for built-in self test of the memory system. 
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The (N+2)x4 128-bit crossbar 15 connects the modular memoiy with the 
four initiators (CP, DP, FP and uP) providing that three flash memory 
modules 3 can be read in parallel at full speed. 

The memoiy space of the four modules 3 is arranged in three 
5 programmable user-defined partitions, each one devoted to a port. The 
memory system clock can run up to 100MHz, and reading three modules 
3 with 128bit data bus and 40ns access time, results in a peak read 
throughput of 1.2GB/s. 

The overall system testability is enhanced by the specific DFT block 16 
10 connected to all relevant internal signals. It makes use of an external high 
voltage power supply, while access from the external test equipment is 
granted by two analog IO pads. 

By means of external analog references, the DFT block 16 can first test its 
own circuitry and then all internally generated voltages and currents, that 
15 are vital for the correct system operations (e.g. band gap voltage, regulated 
voltages, charge pumps). 

The measurement capability of the component can be profitably applied to 
the trimming of analog internal signals, so that also the following 
operations can be implemented: 

20 - reference flash cell current measurement and calibration; 

- voltage and current reference calibration; 

- threshold voltage (as obtained by an analog sense amplifier) 
measurement of memory cells. 

See for example the article by P. L. Rolandi et al., "lM-cell 6b/ cell analog 
25 flash memory for digital storage", ISSCC 1998 Digest of Technical Papers, 
pp 334-335, Feb. 1998. 

The test flux is controlled by the microprocessor 2 present in the system. 

The main components of the DFT block 16 are a network of analog 
switches, multiplexers, a charge integrator, a voltage attenuator, a 
30 comparator and a ten bit pipeline ADC (Analog- to-Digital Converter). 

The two analog IO pads have the main function to provide external 
references for the measurements. But they also allow, by means of the 
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analog switches network, a, wide direct detectability of internal nodes 
under test in the system. 

The voltage path to the ADC is fully differential, yielding advantages in 
terms of power supply noise rejection. 

Hereinafter, Table I resumes the technology parameters and device 
performance of the inventive memory architecture, while in Figure 7 it is 
shown a picture of the test chip that has been designed using a NOR type 
0.18|am flash embedded technology with 1.8V power supply, two poly, six 
metal and memory cell size of 0.35>im 2 . The test chip size is 8.4x4.8 mm 2 . 

TABLE I 

TECHNOLOGY AND DEVICE PARAMETERS 



Process 

Tunneling oxide 
Cell size 
Organization 

Memory module word 
Supply voltage 
Program throughput 
Sector erasing time 
Access time 
Peak read throughput 



0.18M.m CMOS, two poly, six metal 

lOnm 

0.35|am 2 

Four modules x 256Kb x nine 
sectors 

128 bits 

1.6V- 2.0V 

IMB/s 

200ms 

40ns 

1 .2GB/S 



From the previous description it may be appreciated that the memory 
architecture has a whole size of 8Mb application-specific embeddable flash 
15 memory cells and comprises three content-specific I/O ports that can 
deliver a peak read throughput of 1,2GB/ s for non-volatile storage of code, 
data and embedded FPGA bit stream configurations. 
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CLAIMS 

1. An embeddable Flash memory system (1) for non -volatile storage of 
code, data and bit-streams for embedded FPGA configurations, the 
system being integrated into a single chip together with a 

5 microprocessor (2) and including a modular array structure (13) 

comprising N memory blocks (3), wherein a power block (14), including 
charge pumps, is shared among different flash memory modules (3) 
through a PMA arbiter (15) in a multi-bank fashion. 

2. An embeddable Flash memory system according to claim 1, including 
10 three different access ports, each for a specific function: 

- a code port CP (10) optimized for random access time and the 
application system; 

- a data port DP (11) allowing an easy way to access and modify 
application data; and, 

15 - an FPGA port FP (12) offering a serial access for a fast download of bit 
streams for an embedded FPGA (e-FPGA) configurations. 

3. An embeddable Flash memory system according to claim 1, wherein 
said PMA arbiter block (15) includes an order block (19) ordering the 
memory modules (3) requests following these rules: 

20 - status of the request (already active or new request); 

priority information. 

4. An embeddable Flash memory system according to claim 3, further 
including a switch block (21) managing the flash memory modules (3) 
requests based on the output of said order block (19); a request 

25 decoder block (20) being provided for enabling the required high voltage 

resources while a corresponding pump driver block (22) manages the 
power down/ stand-by timeout and limits the requests for each 
resource to the maximum allowed. 

5. An embeddable Flash memory system according to claim 1, wherein 
30 said code port CP (10) comprises four configuration registers defining 

its addressable memory space: two at the application level, and two at 
the flash memory modules level. 
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6. An embeddable Flash memory system according to claim 1, wherein 
said second data port DP (11) manages application data stored in flash 
memory modules (3) using an SRAM page buffer allows the application 
to exchange data in burst mode at maximum speed to increase 

5 performance during write operation. 

7. An embeddable Flash memory system according to claim 5, wherein 
said, third FP port (12) comprises four configuration registers 
replicating the information stored in said code port CP (10) that must 
be used in order to write e-FPGA configurations data. 

10 8. An embeddable Flash memory system according to claim 1, wherein 
said FP port (12) uses a chip select to access in the addressable 
memory space and a burst enable to allow burst serial access. 

9. An embeddable Flash memory system according to claim 1, wherein a 
DFT block (16) is provided and connected to all relevant internal 
15 signals for first internal testing and then all internally generated 

voltages and currents system testability; said DFT block (16) making 
use of an external high voltage power supply, while access from the 
external test equipment is granted by two analog IO pads. 
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ABSTRACT 

The present invention relates to a 8Mb application- specific embeddable 
flash memory. It comprises three content-specific I/O ports and delivers a 
peak read throughput of 1.2GB/s. The memory is combined with a special 
5 automatic programming gate voltage ramp generator circuit, a 
programming rate of 1 Mbyte/ s for non-volatile storage of code, data and 
embedded FPGA bit stream configurations. The test chip has been 
designed using a NOR type 0.18jxm flash embedded technology with 1.8V 
power supply, two poly, six metal and memory cell size of 0.35|j.m 2 . 

0 

(Fig. 1) 
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